Chi-square Tests (Goodness-of-Fit & Independence)#

Chi-square (χ²) tests are hypothesis tests for categorical data. They compare observed counts to expected counts under a null hypothesis.


Learning goals#

  • Know when to use χ² tests (and when not to)

  • Understand where the χ² statistic comes from: “squared standardized residuals”

  • Implement χ² tests from scratch with NumPy

  • Visualize the χ² distribution, p-values, and which cells/categories drive the result

Prerequisites#

  • Basic probability and hypotheses (H0/H1, p-values, α)

  • Categorical variables, contingency tables

  • Comfort with NumPy arrays

When to use#

Use χ² tests when your data are counts in categories (or can be binned into categories).

Common variants:

  1. Goodness-of-fit: do observed category counts match a claimed distribution?

  2. Independence / homogeneity: are two categorical variables associated?

When NOT to use#

  • Continuous data (use t-tests/ANOVA, nonparametric tests, etc.)

  • Very small expected counts (consider combining categories or using exact tests, e.g. Fisher for 2×2)

The hypotheses#

1) Goodness-of-fit (one categorical variable)#

  • H0: category probabilities are p1,…,pk (given by theory/business claim)

  • H1: at least one probability differs

2) Independence (two categorical variables)#

  • H0: X and Y are independent (no association)

  • H1: X and Y are associated

The χ² statistic (Pearson’s X²): “sum of squared standardized residuals”#

For each category/cell:

  • Observed count: O

  • Expected count under H0: E

The standardized residual is roughly:

r = (O − E) / √E

Pearson’s chi-square statistic is:

X² = Σ (O − E)² / E = Σ r²

So X² gets large when observed counts differ from expected counts by more than “random fluctuation” would suggest.

import math
import sys

import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

print('python:', sys.version.split()[0])
print('numpy:', np.__version__)
print('plotly:', plotly.__version__)
python: 3.12.9
numpy: 1.26.2
plotly: 6.5.2

Why “chi-square”?#

If Z1,…,Zk are independent standard normals, then:

Z1² + … + Zk² ~ χ²(k)

That connection is why the χ² distribution appears when we add up many squared (approximately normal) residual-like terms.

# --- Numeric building blocks: log Γ, regularized incomplete gamma, χ² pdf/cdf/sf ---
# We implement the χ² p-value using the identity:
#   If X ~ χ²(df), then P(X >= x) = Q(df/2, x/2)
# where Q is the regularized upper incomplete gamma.

_LANCZOS_COEFFS = np.array(
    [
        0.99999999999980993,
        676.5203681218851,
        -1259.1392167224028,
        771.3234287776531,
        -176.6150291621406,
        12.507343278686905,
        -0.13857109526572012,
        9.984369578019572e-6,
        1.5056327351493116e-7,
    ],
    dtype=float,
)
_LANCZOS_G = 7


def log_gamma(z: float) -> float:
    '''log(Γ(z)) using a Lanczos approximation (real z).'''
    if z <= 0 and float(z).is_integer():
        return float('inf')  # pole
    if z < 0.5:
        # Reflection formula: Γ(z)Γ(1-z) = π / sin(πz)
        return math.log(math.pi) - math.log(abs(math.sin(math.pi * z))) - log_gamma(1 - z)

    z_minus_1 = z - 1.0
    x = float(_LANCZOS_COEFFS[0])
    for i in range(1, len(_LANCZOS_COEFFS)):
        x += float(_LANCZOS_COEFFS[i]) / (z_minus_1 + i)
    t = z_minus_1 + _LANCZOS_G + 0.5
    return 0.5 * math.log(2 * math.pi) + (z_minus_1 + 0.5) * math.log(t) - t + math.log(x)


def _gamma_p_series(a: float, x: float, eps: float = 1e-14, max_iter: int = 10_000) -> float:
    '''Lower regularized gamma P(a,x) via series (stable for x < a+1).'''
    if x == 0:
        return 0.0
    ap = a
    term = 1.0 / a
    summation = term
    for _ in range(max_iter):
        ap += 1.0
        term *= x / ap
        summation += term
        if abs(term) < abs(summation) * eps:
            break
    return summation * math.exp(-x + a * math.log(x) - log_gamma(a))


def _gamma_q_contfrac(a: float, x: float, eps: float = 1e-14, max_iter: int = 10_000) -> float:
    '''Upper regularized gamma Q(a,x) via continued fraction (stable for x >= a+1).'''
    fpmin = 1e-300
    b = x + 1.0 - a
    c = 1.0 / fpmin
    d = 1.0 / max(b, fpmin)
    h = d

    for i in range(1, max_iter + 1):
        an = -i * (i - a)
        b += 2.0

        d = an * d + b
        if abs(d) < fpmin:
            d = fpmin

        c = b + an / c
        if abs(c) < fpmin:
            c = fpmin

        d = 1.0 / d
        delta = d * c
        h *= delta

        if abs(delta - 1.0) < eps:
            break

    return math.exp(-x + a * math.log(x) - log_gamma(a)) * h


def gamma_p(a: float, x: float) -> float:
    '''Lower regularized incomplete gamma P(a,x).'''
    if x < 0 or a <= 0:
        raise ValueError('gamma_p requires a>0 and x>=0')
    if x == 0:
        return 0.0

    if x < a + 1.0:
        p = _gamma_p_series(a, x)
    else:
        # P = 1 - Q
        p = 1.0 - _gamma_q_contfrac(a, x)

    return float(min(1.0, max(0.0, p)))


def gamma_q(a: float, x: float) -> float:
    '''Upper regularized incomplete gamma Q(a,x) = 1 - P(a,x).'''
    if x < 0 or a <= 0:
        raise ValueError('gamma_q requires a>0 and x>=0')
    if x == 0:
        return 1.0

    if x < a + 1.0:
        q = 1.0 - _gamma_p_series(a, x)
    else:
        q = _gamma_q_contfrac(a, x)

    return float(min(1.0, max(0.0, q)))


def chi2_pdf(x, df: int):
    x = np.asarray(x, dtype=float)
    x = np.maximum(x, 0.0)
    k = float(df)
    a = 0.5 * k
    # log f(x) = -a log(2) - log Γ(a) + (a-1)log(x) - x/2
    log_coeff = -a * math.log(2.0) - log_gamma(a)
    with np.errstate(divide='ignore'):
        log_pdf = log_coeff + (a - 1.0) * np.log(x) - 0.5 * x
    pdf = np.exp(log_pdf)
    return pdf


def chi2_cdf(x: float, df: int) -> float:
    return gamma_p(0.5 * df, 0.5 * x)


def chi2_sf(x: float, df: int) -> float:
    return gamma_q(0.5 * df, 0.5 * x)
df_demo = 6
n_sims = 60_000

chi2_sim = np.sum(rng.standard_normal((n_sims, df_demo)) ** 2, axis=1)

x_grid = np.linspace(0.001, np.percentile(chi2_sim, 99.5), 400)
pdf_grid = chi2_pdf(x_grid, df_demo)

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=chi2_sim,
        nbinsx=60,
        histnorm='probability density',
        name='simulation (sum of squares)',
        opacity=0.65,
    )
)
fig.add_trace(go.Scatter(x=x_grid, y=pdf_grid, mode='lines', name=f'χ² pdf (df={df_demo})'))
fig.update_layout(
    title='A χ² distribution as a sum of squared standard normals',
    xaxis_title='x',
    yaxis_title='density',
    bargap=0.02,
)
fig.show()

P-values: how extreme is X²?#

For a computed statistic x_obs and degrees of freedom df:

p-value = P(Χ²(df) ≥ x_obs)

Small p-values mean: “this much discrepancy (or more) would be rare if H0 were true”.

df = 5
x_obs = 11.1
p_val = chi2_sf(x_obs, df)

x = np.linspace(0.001, max(30, x_obs * 1.4), 600)
pdf = chi2_pdf(x, df)

tail_mask = x >= x_obs

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=pdf, mode='lines', name=f'χ² pdf (df={df})'))
fig.add_trace(
    go.Scatter(
        x=np.concatenate([x[tail_mask], x[tail_mask][::-1]]),
        y=np.concatenate([pdf[tail_mask], np.zeros_like(pdf[tail_mask])]),
        fill='toself',
        fillcolor='rgba(214,39,40,0.25)',
        line=dict(color='rgba(0,0,0,0)'),
        name=f'p-value area = {p_val:.4f}',
    )
)
fig.add_vline(x=x_obs, line_dash='dash', line_color='firebrick')
fig.update_layout(title='Right-tail p-value for a χ² test', xaxis_title='χ² value', yaxis_title='density')
fig.show()

print(f'x_obs={x_obs:.3f}, df={df}, p-value={p_val:.6f}')
x_obs=11.100, df=5, p-value=0.049433

From scratch: χ² test helpers (NumPy only)#

We’ll implement:

  • χ² statistic

  • goodness-of-fit test

  • independence test (contingency table)

  • diagnostics: contributions, standardized residuals, Cramér’s V

def chi_square_statistic(observed, expected) -> float:
    observed = np.asarray(observed, dtype=float)
    expected = np.asarray(expected, dtype=float)
    if observed.shape != expected.shape:
        raise ValueError('observed and expected must have the same shape')
    if np.any(expected <= 0):
        raise ValueError('expected counts must be positive')
    return float(np.sum((observed - expected) ** 2 / expected))


def chi_square_contributions(observed, expected):
    observed = np.asarray(observed, dtype=float)
    expected = np.asarray(expected, dtype=float)
    return (observed - expected) ** 2 / expected


def standardized_residuals(observed, expected):
    observed = np.asarray(observed, dtype=float)
    expected = np.asarray(expected, dtype=float)
    return (observed - expected) / np.sqrt(expected)


def chi_square_gof(observed, expected_probs=None, expected_counts=None, ddof: int = 0):
    '''Goodness-of-fit test for 1D category counts.

    ddof: how many parameters were estimated from the data to form expected_probs.
          (df = k - 1 - ddof)
    '''
    observed = np.asarray(observed, dtype=float)
    if observed.ndim != 1:
        raise ValueError('observed must be 1D category counts')
    n = float(observed.sum())
    if n <= 0:
        raise ValueError('total count must be positive')

    if expected_counts is None:
        if expected_probs is None:
            raise ValueError('provide expected_probs or expected_counts')
        expected_probs = np.asarray(expected_probs, dtype=float)
        if expected_probs.shape != observed.shape:
            raise ValueError('expected_probs must match observed')
        if np.any(expected_probs < 0):
            raise ValueError('expected_probs must be non-negative')
        if not np.isclose(expected_probs.sum(), 1.0):
            raise ValueError('expected_probs must sum to 1')
        expected_counts = n * expected_probs
    else:
        expected_counts = np.asarray(expected_counts, dtype=float)
        if expected_counts.shape != observed.shape:
            raise ValueError('expected_counts must match observed')

    stat = chi_square_statistic(observed, expected_counts)
    df = int(observed.size - 1 - ddof)
    if df <= 0:
        raise ValueError('degrees of freedom must be positive; check ddof')
    p_value = chi2_sf(stat, df)
    return dict(statistic=stat, df=df, p_value=p_value, expected=expected_counts)


def expected_counts_independence(table):
    observed = np.asarray(table, dtype=float)
    if observed.ndim != 2:
        raise ValueError('table must be 2D (rows x columns)')
    n = float(observed.sum())
    if n <= 0:
        raise ValueError('table must have positive total count')
    row_sums = observed.sum(axis=1, keepdims=True)
    col_sums = observed.sum(axis=0, keepdims=True)
    return row_sums @ col_sums / n


def chi_square_independence(table):
    observed = np.asarray(table, dtype=float)
    expected = expected_counts_independence(observed)
    stat = chi_square_statistic(observed, expected)
    r, c = observed.shape
    df = int((r - 1) * (c - 1))
    p_value = chi2_sf(stat, df)
    return dict(statistic=stat, df=df, p_value=p_value, expected=expected)


def cramers_v(chi2_stat: float, n: float, r: int, c: int) -> float:
    denom = n * min(r - 1, c - 1)
    if denom <= 0:
        return float('nan')
    return math.sqrt(chi2_stat / denom)

Example A: Goodness-of-fit (is a die fair?)#

Suppose we roll a 6-sided die n times.

  • H0: p(face=i) = 1/6 for i=1..6 (fair die)

  • H1: the probabilities are not all 1/6

faces = np.arange(1, 7)
observed = np.array([15, 23, 16, 19, 24, 23])

expected_probs = np.ones_like(observed) / observed.size
gof = chi_square_gof(observed, expected_probs=expected_probs)
gof
{'statistic': 3.8,
 'df': 5,
 'p_value': 0.5785552914362746,
 'expected': array([20., 20., 20., 20., 20., 20.])}
expected = gof['expected']
contrib = chi_square_contributions(observed, expected)

fig = go.Figure()
fig.add_trace(go.Bar(x=faces, y=observed, name='observed'))
fig.add_trace(go.Bar(x=faces, y=expected, name='expected (H0)'))
fig.update_layout(
    barmode='group',
    title='Die rolls: observed vs expected counts',
    xaxis_title='face',
    yaxis_title='count',
)
fig.show()

fig = go.Figure(go.Bar(x=faces, y=contrib))
fig.update_layout(
    title='Per-category contribution to χ² statistic',
    xaxis_title='face',
    yaxis_title='(O−E)²/E',
)
fig.show()

print('χ² statistic:', gof['statistic'])
print('df:', gof['df'])
print('p-value (χ² approx):', gof['p_value'])
print('sum of contributions:', float(contrib.sum()))
χ² statistic: 3.8
df: 5
p-value (χ² approx): 0.5785552914362746
sum of contributions: 3.8

Interpreting the goodness-of-fit result#

  • If the p-value is small (≤ α), the data are inconsistent with the claimed probabilities.

  • If the p-value is not small, you do not prove H0; you only say the data are compatible with H0.

Also remember:

  • χ² tests become very sensitive with large n.

  • A single p-value won’t tell you which categories are driving the mismatch (use residuals/contributions).

# Monte Carlo view of the null distribution (GOF)
n_sims = 50_000
n = int(observed.sum())

sim_counts = rng.multinomial(n, expected_probs, size=n_sims)
expected_counts = n * expected_probs
sim_stats = np.sum((sim_counts - expected_counts) ** 2 / expected_counts, axis=1)

p_mc = (np.sum(sim_stats >= gof['statistic']) + 1) / (n_sims + 1)

x_grid = np.linspace(0.001, np.percentile(sim_stats, 99.7), 400)
pdf_grid = chi2_pdf(x_grid, gof['df'])

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=sim_stats,
        nbinsx=70,
        histnorm='probability density',
        name='Monte Carlo (H0)',
        opacity=0.65,
    )
)
fig.add_trace(go.Scatter(x=x_grid, y=pdf_grid, mode='lines', name='χ² approximation'))
fig.add_vline(x=gof['statistic'], line_dash='dash', line_color='firebrick')
fig.update_layout(
    title='Null distribution of χ² statistic (GOF): simulation vs χ² approximation',
    xaxis_title='χ² statistic',
    yaxis_title='density',
)
fig.show()

print('Analytic p-value (χ² approx):', gof['p_value'])
print('Monte Carlo p-value:', p_mc)
Analytic p-value (χ² approx): 0.5785552914362746
Monte Carlo p-value: 0.5857282854342913

Which categories drive the discrepancy?#

The overall p-value is one number. To see where the discrepancy lives, look at:

  • standardized residuals r = (O−E)/√E

  • contributions (O−E)²/E

Large |r| means that category differs from expectation more than random fluctuation would suggest.

resid = standardized_residuals(observed, expected)

fig = go.Figure(go.Bar(x=faces, y=resid))
fig.update_layout(
    title='Standardized residuals per face',
    xaxis_title='face',
    yaxis_title='(O−E)/√E',
)
fig.add_hline(y=2, line_dash='dot', line_color='gray')
fig.add_hline(y=-2, line_dash='dot', line_color='gray')
fig.show()

resid
array([-1.118 ,  0.6708, -0.8944, -0.2236,  0.8944,  0.6708])

Example B: Test of independence (contingency table)#

We observe counts for two categorical variables.

  • H0: the variables are independent (joint probability = product of marginals)

  • Expected cell count under H0:

Eᵢⱼ = (row_totalᵢ × col_totalⱼ) / n

  • χ² statistic sums (Oᵢⱼ − Eᵢⱼ)² / Eᵢⱼ over all cells

  • df = (r−1)(c−1)

row_labels = np.array(['New', 'Returning'])
col_labels = np.array(['Chat', 'Email', 'Phone'])

table = np.array(
    [
        [50, 30, 20],
        [30, 45, 25],
    ]
)

ind = chi_square_independence(table)
expected_ind = ind['expected']
resid_ind = standardized_residuals(table, expected_ind)
contrib_ind = chi_square_contributions(table, expected_ind)

n = float(table.sum())
v = cramers_v(ind['statistic'], n, *table.shape)

print('χ² statistic:', ind['statistic'])
print('df:', ind['df'])
print('p-value (χ² approx):', ind['p_value'])
print("Cramér's V:", v)
χ² statistic: 8.555555555555557
df: 2
p-value (χ² approx): 0.01387345776252751
Cramér's V: 0.20682789409984761
def annotated_heatmap(z, x_labels, y_labels, title, fmt='{:.1f}', colorscale='Blues', zmid=None):
    z = np.asarray(z)
    text = [[fmt.format(v) for v in row] for row in z]
    heatmap_kwargs = dict(
        z=z,
        x=list(x_labels),
        y=list(y_labels),
        text=text,
        texttemplate='%{text}',
        colorscale=colorscale,
        colorbar=dict(title='value'),
    )
    if zmid is not None:
        heatmap_kwargs['zmid'] = zmid
    fig = go.Figure(go.Heatmap(**heatmap_kwargs))
    fig.update_layout(title=title, xaxis_title='column category', yaxis_title='row category')
    return fig


annotated_heatmap(table, col_labels, row_labels, title='Observed counts', fmt='{:.0f}').show()
annotated_heatmap(expected_ind, col_labels, row_labels, title='Expected counts under independence (H0)').show()
annotated_heatmap(resid_ind, col_labels, row_labels, title='Standardized residuals (direction + strength)', fmt='{:.2f}', colorscale='RdBu', zmid=0).show()
annotated_heatmap(contrib_ind, col_labels, row_labels, title='Contributions to χ² (where the test is "spent")', fmt='{:.3f}', colorscale='Reds').show()

Interpreting the independence test#

  • If p-value ≤ α: evidence that variables are associated (not independent).

  • This does not prove causality — only association.

  • Standardized residuals tell direction (positive = more than expected, negative = less).

  • Report an effect size like Cramér’s V (especially when n is large).

# Monte Carlo view of the null distribution (independence)
# We simulate tables under an independence model using plug-in (estimated marginals).
n_sims = 50_000
n_int = int(n)

row_probs = table.sum(axis=1) / n
col_probs = table.sum(axis=0) / n
cell_probs = np.outer(row_probs, col_probs).ravel()

sim_flat = rng.multinomial(n_int, cell_probs, size=n_sims)
sim_tables = sim_flat.reshape(n_sims, table.shape[0], table.shape[1])

# Compute the χ² statistic the same way as the test does: expected from margins of each simulated table.
row_sums = sim_tables.sum(axis=2, keepdims=True)
col_sums = sim_tables.sum(axis=1, keepdims=True)
expected_sim = row_sums * col_sums / n
sim_stats = np.sum((sim_tables - expected_sim) ** 2 / expected_sim, axis=(1, 2))

p_mc = (np.sum(sim_stats >= ind['statistic']) + 1) / (n_sims + 1)

x_grid = np.linspace(0.001, np.percentile(sim_stats, 99.7), 400)
pdf_grid = chi2_pdf(x_grid, ind['df'])

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=sim_stats,
        nbinsx=70,
        histnorm='probability density',
        name='Monte Carlo (H0)',
        opacity=0.65,
    )
)
fig.add_trace(go.Scatter(x=x_grid, y=pdf_grid, mode='lines', name='χ² approximation'))
fig.add_vline(x=ind['statistic'], line_dash='dash', line_color='firebrick')
fig.update_layout(
    title='Null distribution (independence): simulation vs χ² approximation',
    xaxis_title='χ² statistic',
    yaxis_title='density',
)
fig.show()

print('Analytic p-value (χ² approx):', ind['p_value'])
print('Monte Carlo p-value:', p_mc)
print("Cramér's V:", v)
Analytic p-value (χ² approx): 0.01387345776252751
Monte Carlo p-value: 0.013919721605567889
Cramér's V: 0.20682789409984761

Assumptions and pitfalls#

  1. Independent observations: counts come from independent draws (no repeated-measures unless modeled)

  2. Adequate expected counts: a common rule is all Eᵢⱼ ≥ 5 (or at least 80% ≥ 5 and none < 1)

  3. Categories are mutually exclusive and collectively exhaustive

  4. Large n makes the χ² approximation better — but also makes tiny differences “significant”

Diagnostics / good practice:

  • Inspect expected counts; merge sparse categories if needed

  • Look at contributions and residuals, not only the p-value

  • Report an effect size (Cramér’s V) and context

  • Beware multiple testing if you run many χ² tests

Practical usage (SciPy, optional)#

SciPy provides:

  • scipy.stats.chisquare for goodness-of-fit

  • scipy.stats.chi2_contingency for independence (with optional Yates correction for 2×2)

The NumPy implementation above is for learning; SciPy is recommended for production.

try:
    from scipy.stats import chi2_contingency, chisquare

    print('SciPy available:', True)

    print('\nGOF via scipy.stats.chisquare:')
    print(chisquare(f_obs=observed, f_exp=gof['expected']))

    print('\nIndependence via scipy.stats.chi2_contingency:')
    chi2_stat, p, dof, exp = chi2_contingency(table, correction=False)
    print({'statistic': chi2_stat, 'p_value': p, 'df': dof})
except Exception as e:
    print('SciPy not available (or import failed):', repr(e))
SciPy available: True

GOF via scipy.stats.chisquare:
Power_divergenceResult(statistic=3.8, pvalue=0.5785552914362739)

Independence via scipy.stats.chi2_contingency:
{'statistic': 8.555555555555557, 'p_value': 0.0138734577625275, 'df': 2}

Exercises#

  1. Goodness-of-fit: simulate an unfair die and see how sample size affects p-value.

  2. Independence: create a contingency table where the association is weak but n is huge; compare p-value vs Cramér’s V.

  3. Check the expected-count rule: what happens when some expected counts are < 5?

  4. Implement Yates’ correction for a 2×2 table and compare.

References#

  • Pearson’s chi-square test (core statistic X² = Σ (O−E)²/E)

  • Any intro stats text: expected counts, df formulas, and assumptions

  • SciPy docs: scipy.stats.chisquare, scipy.stats.chi2_contingency